Load Necessary Libraries
library(DBI)
library(readr)
library(RSQLite)
library(dplyr)
library(stringr)
This project simulates a real-world e-commerce data environment, including all stages of data management, from database configuration to data analysis and reporting. The database is created and managed with SQLite while GitHub Actions are used to automate processes enhancing workflow efficiency. In addition, Quarto is used in conjunction with R to conduct comprehensive data analysis and generate reports. Through this approach, technical skills and competencies necessary for effective data management are examined in-depth, mirroring the complexities and dynamics encountered in e-commerce.
Figure 1 shows the E-R diagram that simulates a real-world e-commerce data ecosystem, capturing the detailed relationships between entities and attributes essential for facilitating online transactions. In addition, it provides a comprehensive view of the e-commerce system, which serves as a platform for users to browse products, make purchases, and securely complete their payments.
The company only distributes products within the United Kingdom (UK).
The Currency used is Pound Sterling (GBP).
Attributes formats will be aligned with UK standard formats such as date , addresses , names …etc
This section describes and illustrates the entities in the above ERD and their respective attributes.
Shows us the users who have placed orders in the past. Attributes include name , email, phone number and address of the customer.
Vendors who supply products.This entity contains attributes of names, addresses, emails and status of whether the supplier is currently supplying product or not (Active/Inactive).
Describes all products in the stock and available for sale. Attributes include name, price, and availability of the products.
Emphasises all details related to placed orders. Some of its attributes includes billing and shipping addresses, type and status of payment, date and status of the order.
Category is the broad classification of products that share common features or are intended for a similar purpose. A sub-category is a more specific grouping of products within a category based on finer distinctions or attributes.
The voucher number or offer code to be applied to eligible products. The amount of discount it offers as well as the status of the discount are the attributes included.
Contains Written comments and rating of product bought by customers, the likes recieved as well as the time stamp of when the review was written.
The model intentionally skips direct order management. Instead, it focuses on product management and customer interactions through reviews and payment methods.
Order Entity not considered in this ER design in order to follow best practices by not having to include orderId as part of product table which might affect the overall performance of DB retrieval.
This relation as can be seen in Figure 2 indicates that each order detail contains multiple products that have been bought by the customer.
Associative Attribute (Quantity: The number of units of the product ordered in this line item.)
This relation will be created when customers orders one or more products. They will be linked with a particular order detail indicating the order status throughout the process. One customer can be associated with multiple order details at any given time as illustrated in Figure 3. Moreover, it aids in tracking an order’s life cycle, allowing for updates, customer notifications, and management of the order fulfillment process.
Figure 4 specify that each product is classified under a specific Category where products can belong to only one category. This enables customers to browse products by category and helps retailers manage product listings more efficiently.
As reflected in Figure 5, category can have multiple subcategories. For example, “Home Appliances” is the parent category containing subcategories like Washing Machines, Food Processors, Dishwashers etc.
This relation is important for inventory management, reflecting the real-world scenario where a single supplier might provide various products, but each product is supplied by only one specific supplier Figure 6. The relation helps track inventory sources, manage supplier relationships, and ensure product availability.
This relation signifies that a product can be eligible for a promotional discount. In this context and for simplicity the relation representing one discount code or voucher that is valid to apply on multiple eligible products Figure 7.
A product’s reviews are provided by Customers reflecting feedback or evaluation for a specific product or service. One product can have multiple reviews over time as shown in Figure 8.
The table below describes the logical schema of our E-commerce database. Italic attributes signify foreign keys.
#Creating the db schema in SQL
con <- dbConnect(RSQLite::SQLite(), "ecommerce.db")
sql_file <- readLines("dbScript.sql")
for (sql_command in sql_file) {
if (sql_command!=""){
print(sql_command)
dbExecute(con,sql_command)
print("-------------DONE---------")
}
}
The First Normal Form (1NF) mandates atomic values and unique rows via a primary key without duplicate columns. The tables were verified to meet these criteria. Second Normal Form (2NF) is achieved when a table in 1NF has all non-key attributes fully dependent on the primary key, thus avoiding partial dependencies on composite keys which has also been achieved. Finally, the team conducted thorough checks to ensure all tables comply with 3NF requirements, as detailed in subsequent sections.
Looking at the table, we can identify the following dependencies:
Product_ID and Order_ID combined are representing composite key, and Quantity is functionally dependent on Product_ID and Order_ID. There are no transitive dependencies here.
Given that Cust_ID is a foreign key in this table and that there is a separate Customers table where Cust_ID serves as the primary key, it suggests that Cust_ID is not functionally dependent on any other attribute within this table. It is merely referencing the primary key of another table.
Therefore, the dependency involving Cust_ID does not violate 3rd Normal Form (3NF). The table appears to be in 3NF as all other attributes are functionally dependent only on the primary key (Order_ID)
Category_ID is the primary key and it uniquely identifies Category_Name.
Analyzing the functional dependencies, it is found that Discount_Code is the primary key and it uniquely identifies Discount_Amount and Discount_Status. Each Discount_Code corresponds to a specific Discount_Amount and Discount_Status, ensuring that the table adheres to the principles of 3rd Normal Form (3NF). There are also no transitive dependencies or non-key attributes determining other attributes within the table.
The table also contains foreign keys, namely Supplier_ID, Discount_Code, and Category_ID. These foreign keys are merely referencing the primary key of their respective tables and are not functionally dependent on any other attribute within this table. Examining the non key attributes’ functional dependencies, Product_ID emerges as the primary key, in which Product_Name, Product_Price and Product_Availability appears to be functionally dependent on it.
In the Reviews table, Review_ID acts as the primary key, ensuring uniqueness. Attributes like Review_Timestamp, Product_Rating, Review_Text and Review_Likes are dependent on Review_ID.
In the Suppliers table, Supplier_ID acts as the primary key. Attributes such as Supplier_Name, Supplier_Building_Name, Supplier_Building_Number, Supplier_Street_Name, Supplier_Zip_Code, Supplier_Email and Supplier_Status are dependent on Supplier_ID.
Given these functional dependencies where each attribute seems to be functionally dependent on the primary key, with no non-key attributes determining other, all tables appears to adhere to the principles of 3rd Normal Form (3NF).
After the agreement on the schema mentioned in the previous section, the team started to generate synthetic data that to some extent, imitated realistic e-commerce as much as possible.
ChatGPT has been used as the main tool for this step as an alternative to Mockaroo, as the former produces more structural and logical data than the latter. [Appendix 1,2]. For tables consisting of foreign keys, R code has been used for assignment based of logic set by the team Appendix 3.
To enhance data quality for e-commerce analysis and validate the csv data generated, we use R to perform 4 main checks as shown in Figure 9 [Refer data_validate.R]:
Once the data has gone through all the checks, the updated csv files will be used to import into SQL. Importing the data into SQL is also done in R [Refer data_update.R]. This method focused on refining datasets for accurate analysis by identifying and eliminating invalid entries and duplicates, thereby maintaining the dataset’s uniqueness and reliability before inserting into Data Base.
To efficiently collaborate on this assignment, we have uploaded the project to Github. This was done so that the team could track changes, follow progress, collaborate on bug fixes etc. Moving the project to Github also allowed for implementation of version control and continuous integration.
The pipeline generation process consists of two phases. In the first phase, github repository has been created\(^{[1]}\) followed by integrating the posit cloud project. Additionally, all teams members have been added as collaborators.
Second phase was the automating our workflow. Using Github Workflows, we were able to automate the process of validating data, updating data and data analysis. The workflow is triggered on pushing new data to the project. Once all 3 steps are completed successfully, the new files are committed and pushed to the project. Figures 10, 11, 12, 13 show how the db and analyses carried out is automatically updated after a workflow run
Information regarding product ratings, sales revenue, customer distribution, and order status was obtained through the establishment of a connection to an SQLite database and the execution of SQL queries. Subsequently, data visualisation techniques were used to identify the distribution and patterns within the data, offering insights into e-commerce operational dimensions.
After joining the “Order_Items”, “Discounts”, and “Products” tables, the derived value, which is the “Total Revenue”, is calculated by multiplying “Order_Item” quantity and “Product_Price” and deducting it with “Discount_Amount.
Upon completion of this project, valuable insights are gained regarding the challenges and methodologies associated with an e-commerce context, including the design of databases, analysis of data, and presentation of findings in a clear and impactful manner. Ultimately, this project serves as a reference for future e-commerce projects in particular and data-driven projects in general, providing valuable insight into the modern data management landscape.